Goto

Collaborating Authors

 human motion


SnapMoGen: Human Motion Generation from Expressive Texts

Neural Information Processing Systems

Text-to-motion generation has experienced remarkable progress in recent years. However, current approaches remain limited to synthesizing motion from short or general text prompts, primarily due to dataset constraints. This limitation undermines fine-grained controllability and generalization to unseen prompts. In this paper, we introduce SnapMoGen, a new text-motion dataset featuring highquality motion capture data paired with accurate, expressive textual annotations. The dataset comprises 20K motion clips totaling 44 hours, accompanied by 122K detailed textual descriptions averaging 48 words per description (vs.


HMVLM: Human Motion-Vision-Lanuage Model via MoELoRA

Neural Information Processing Systems

The expansion of instruction-tuning data has enabled foundation language models to exhibit improved instruction adherence and superior performance across diverse downstream tasks. Semantically-rich 3D human motion is being progressively integrated with these foundation models to enhance multimodal understanding and cross-modal generation capabilities. However, the modality gap between human motion and text raises unresolved concerns about catastrophic forgetting during this integration. In addition, developing autoregressive-compatible pose representations that preserve generalizability across heterogeneous downstream tasks remains a critical technical barrier. To address these issues, we propose the Human MotionVision-Language Model (HMVLM), a unified framework based on the Mixture of Expert Low-Rank Adaption(MoE LoRA) strategy. The framework leverages the gating network to dynamically allocate LoRA expert weights based on the input prompt, enabling synchronized fine-tuning of multiple tasks. To mitigate catastrophic forgetting during instruction-tuning, we introduce a novel zero expert that preserves the pre-trained parameters for general linguistic tasks. For pose representation, we implement body-part-specific tokenization by partitioning the human body into different joint groups, enhancing the spatial resolution of the representation. Experiments show that our method effectively alleviates knowledge forgetting during instruction-tuning and achieves remarkable performance across diverse human motion downstream tasks.


🎧MOSPA: Human Motion Generation Driven by Spatial Audio

Neural Information Processing Systems

Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis. Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities like speech, audio, and music to generate human motion. As of yet, these models typically overlook the impact of spatial features encoded in spatial audio signals on human motion. To bridge this gap and enable high-quality modeling of human movements in response to spatial audio, we introduce the first comprehensive Spatial Audio-Driven Human Motion (SAM) dataset, which contains diverse and high-quality spatial audio and motion data. For benchmarking, we develop a simple yet effective diffusion-based generative framework for human MOtion generation driven by SPatial Audio, termed MOSPA, which faithfully captures the relationship between body motion and spatial audio through an effective fusion mechanism. Once trained, MOSPA can generate diverse realistic human motions conditioned on varying spatial audio inputs. We perform a thorough investigation of the proposed dataset and conduct extensive experiments for benchmarking, where our method achieves state-of-the-art performance on this task.


BEDLAM2.0: Synthetic humans and cameras in motion

Neural Information Processing Systems

Inferring 3D human motion from video remains a challenging problem with many applications. While traditional methods estimate the human in image coordinates, many applications require human motion to be estimated in world coordinates. This is particularly challenging when there is both human and camera motion. Progress on this topic has been limited by the lack of rich video data with ground truth human and camera movement. We address this with BEDLAM2.0, a new dataset that goes beyond the popular BEDLAM dataset in important ways.


Harmonizing Stochasticity and Determinism: Scene-responsive Diverse Human Motion Prediction

Neural Information Processing Systems

Diverse human motion prediction (HMP) is a fundamental application in computer vision that has recently attracted considerable interest. Prior methods primarily focus on the stochastic nature of human motion, while neglecting the specific impact of external environment, leading to the pronounced artifacts in prediction when applied to real-world scenarios. To fill this gap, this work introduces a novel task: predicting diverse human motion within real-world 3D scenes. In contrast to prior works, it requires harmonizing the deterministic constraints imposed by the surrounding 3D scenes with the stochastic aspect of human motion. For this purpose, we propose DiMoP3D, a diverse motion prediction framework with 3D scene awareness, which leverages the 3D point cloud and observed sequence to generate diverse and high-fidelity predictions. DiMoP3D is able to comprehend the 3D scene, and determines the probable target objects and their desired interactive pose based on the historical motion. Then, it plans the obstacle-free trajectory towards these interested objects, and generates diverse and physically-consistent future motions. On top of that, DiMoP3D identifies deterministic factors in the scene and integrates them into the stochastic modeling, making the diverse HMP in realistic scenes become a controllable stochastic generation process. On two real-captured benchmarks, DiMoP3D has demonstrated significant improvements over state-of-the-art methods, showcasing its effectiveness in generating diverse and physically-consistent motion predictions within real-world 3D environments.